Network information and connected correlations 
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Entropy and information provide natural measures of correlation among elements in a network. 
We construct here the information theoretic analog of connected correlation functions: irreducible 
iV-point correlation is measured by a decrease in entropy for the joint distribution of N variables 
relative to the maximum entropy allowed by all the observed N — 1 variable distributions. We 
calculate the "connected information" terms for several examples, and show that it also enables 
the decomposition of the information that is carried by a population of elements about an outside 
source. 
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In statistical physics and field theory, the nature of or- 
der in a system is characterized by correlation functions. 
These ideas are especially powerful because there is a 
direct relation between the correlation functions and ex- 
perimental observables such as scattering cross sections 
and susceptibilities. As we move toward the analysis of 
more complex systems, such as the interactions among 
genes or neurons in a network, it is not obvious how to 
construct correlation functions which capture the under- 
lying order. On the other hand it is possible to observe 
directly the activity of many single neurons in a network 
or the expression levels of many genes, and hence real 
experiments in these systems are more like Monte Carlo 
simulations, sampling the distribution of network states. 

Shannon proved that, given a probability distribution 
over a set of variables, entropy is the unique measure 
of what can be learned by observing these variables, 
given certain simple and plausible criteria (continuity, 
monotonicity and additivity) 0. By the same argu- 
ments, mutual information arises as the unique mea- 
sure of the interdependence of two variables, or two sets 
of variables. Defining information theoretic analogs of 
higher order correlations has proved to be more difficult 
IIB S S HQ, HUGH when we compute iV-point cor- 
relation functions in statistical physics and field theory, 
we are careful to isolate the connected correlations, which 
are the components of the ./V-point correlation that can- 
not be factored into correlations among groups of fewer 
than N observables. We propose here an analogous mea- 
sure of "connected information" which generalizes pre- 
cisely our intuition about connectedness and interactions 
from field theory; a closely related discussion for quan- 
tum information has been given recently 11]. 

Consider N variables {x{\,i = 1,2, N, drawn from 
the joint probability distribution P({xi}); this has an 
entropy 



5({x i }) = -^P({x i })logP({ : r i }). 



(1) 



entropy S({ Xl }) is smaller than the sum of the entropies 
for each variable individually, 



(2) 



The total difference in entropy between the interacting 
variables and the variables taken independently can be 
written as 0, 



i({ Xi }) = "£s( Xi )-s({ Xi }) 



^P({ Si })log 



rij^j) 



(3) 



The fact that N variables are correlated means that the 



which is the Kullback-Leibler divergence between the 
true distribution P({xi}) and the "independent" model 
formed by taking the product of the marginals, Y[j Pj( x i)- 
This has been called the multi-information; it provides 
a general measure of non-independence among multiple 
variables in a network. 

The multi-information alone does not tell us how much 
of the non-independence among N variables is intrinsic 
to the full N variables and how much can be explained 
from pairwise, triple, and higher order interactions. For 
example, if the X{S are binary variables or equivalently 
Ising spins a-,, and if the full distribution P({ci}) is a 
conventional Ising model with pairwise exchange interac- 
tions, then in an obvious sense there is nothing "new" to 
learn by observing triplets of spins that can't be learned 
by looking at all the pairs. On the other hand, if 03 
is formed as the exclusive OR (XOR) of the variables 
<7i and 02, then the essential structure of P((J\, 02, 03) 
is contained in a three-spin interaction; if o~\ an 02 are 
chosen at random as inputs to the XOR, then all pairwise 
mutual informations among the a, will be zero, although 
the multi-information will be one bit (Fig. 1) 

What we would like to do in our example of three vari- 
ables is to separate that component of I(xi; X2\ £3) which 
is expected from observations on pairs of variables from 
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that component which is intrinsic to the triplet. Observ- 
ing the variables in pairs means that we can construct 
all of the pairwise marginals Py = J2 X P( x ii x v x k)- 
Knowledge of these marginals provides (in general) a par- 
tial characterization of the full probability distribution 
P(xi,X2,Xa). Following Jaynes [lj| we can quantify this 
knowledge by saying that the pairwise marginals set a 
maximum value of the entropy for the full distribution. 
More generally, if we have N variables and we observe all 
the subsets of k elements, then there is a maximum en- 
tropy for the distribution P({xi}) that is consistent with 
all of the fc-th order marginals. Let us write this maxi- 
mum entropy distribution by P^ ({x{}) and denote the 
entropy of a probability distribution by S[P]; note that 



P {1) ({x i }) = l[P i (x i ), 



(4) 



and that p( N \{xi}) is just the true distribution P({xi}). 
Then we can decompose the multi-information among 
the N variables into a sequence of terms: 



I({ Xi }) = S 



N 



-S[P({*i})] 



JY 



= E J c ({*» 



(5) 



fc=2 



where we define the connected information of order k, 
= SlP^Hixi})] 5[p( fe )({xi})]. (6) 

The connected information of order k is positive or zero; 
it represents the amount by which the maximum possible 
entropy of the system decreases when we go from know- 
ing only the marginals of order k — 1 to knowing also 
the marginals of order k. Each time that we increase the 
number of elements that we can observe simultaneously 
we uncover a potentially richer set of correlations, lead- 
ing to a reduction in the maximum possible entropy; the 
connected information measures this entropy reduction. 

Computing the connected information requires that we 
construct the maximum entropy distributions consistent 
with marginals of order k. In general this is a difficult 
problem. Recall that to maximize the entropy when we 
know the expectation values of functions F^({xi}), the re- 
sulting probability distribution is of the Boltzmann form, 
P({xi}) cx exp[- J2fj, V^Mfai})]) wnere trie A ,, ar e La- 
grange multipliers conjugate to each function |l3j. We 
can think of each marginal distribution as a set of ex- 
pectation values over the full distribution, so that we 
need one Lagrange multiplier for each k-tuple of x values. 
The distribution P^ thus has the form of a Boltzmann 
distribution with A; -body interactions; these interactions 
are arbitrary functions which have to be determined by 
matching the observed marginals. As an example, for 



three variables with known pairwise marginals the max- 
imum entropy distribution takes the form 

P ( - 2 \x 1 ,X2,X3) = ^cxp[-X 12 (x l7 x 2 ) 

-X 2 3(x2,x 3 ) - A3i(a!3,aJi)]- (?) 

For a physical system that has at most if-body inter- 
actions among the N variables, pW will be the exact 
distribution. Correspondingly, P c = for k > K. 

In general the functions A are difficult to determine 
from the observed marginals, but this is not the case 
for k = 1. This is a well known but important point: 
the maximum entropy distribution consistent with one- 
body marginals is just the product of the marginals, but 
the maximum entropy distribution consistent even with 
two-body (pairwise) marginals is not simply written in 
terms of the marginals because the observed two-body 
correlations include an average over interactions with all 
other degrees of freedom. As a result, even the second 
order maximum entropy distributions for TV variables are 
not simply related to the pairwise marginals, and the 
second order connected information is not simply related 

(2) 

to the mutual information among pairs of variables; I c 
is larger than the mutual information between any pair 
of variables, but is not equal to their sum. 

The fact that maximum entropy distributions have an 
exponential form, and in the binary or Ising case this 
form includes only a finite set of parameters, connects 
our discussion with previous work. A number of authors 
have used the maximum entropy distribution for families 
of parameterized models as part of statistical tests for 
the existence of higher order interactions 0, 0, 0] In re- 
lated work, Amari |p| has constructed a geometry on the 
parameter space for exponential families using the Fisher 
information as a metric, and in this geometry the max- 
imum entropy distributions are orthogonal projections 
onto subspaces of the full parametric space (see also [j| ) . 
Rather than providing a parametric model of k-th or- 
der interactions and determining a confidence level, the 

(k) 

set of Iq provides a quantitative characterization of the 
relative importance of various order interactions, inde- 
pendent of parameterization. 

As examples (Fig. consider three binary or Ising 
variables related either by boolean functions (AND, OR, 
XOR) or coupled through a pairwise ferromagnetic in- 
teractions (FM). For these simple functions, we find that 
the multi-information is composed of either pure 2-body 
interactions or pure 3-body ones, as our intuition sug- 
gests. When we add noise either to the input or output 
of the boolean functions (Fig. |2J) we degrade the corre- 
lations, but more interestingly we find that pure 2-body 
interactions such as AND and OR show a 3-body inter- 
action component for some types of noise (even for noise 
sources which are state dependent). For the pure 3-body 
XOR, noise may result in the appearance of 2-body in- 
teractions. For these three functions, input noise only 
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FIG. 1: The values of multi-information, connected-information 
of orders 2 and 3, the pairwise mutual information and pairwise 
redundancy for 3 binary variables, whose probability distribution 
is given by the logical functions AND, OR and XOR (with the 
inputs a\ and tf"2 chosen at random) , and the case of ferromagnetic 
interaction, FM. 



input noise 

\ output noise 




0.5 0.5 0.5 

P(flpo 3 ) P(flp 0l ) P(flp 0^=1 ^2=D 

FIG. 2: Correlated— information of orders 2 and 3 and the multi- 
information for 3 variables whose joint probability distribution is 
given by noisy logical functions. Each panel presents the Iq's and 
I values for a noisy version of one boolean gate (XOR in first row, 
OR in second, AND in third), as a function of noise amplitude. 
The three types of noise are output noise (probability of flipping 
CT3), input noise (probability of flipping <ji) and input-dependent 
output noise (probability of flipping CT3, given that <ti = 1 and 

<T 2 = 1). 

changes the strength of the existing interactions, rather 
than introducing a new kind of effective interaction. 

As is familiar in physical examples, if we observe only 
some of the elements of a network then the effect of the 
hidden elements may be to create new effective interac- 
tions among the observed elements. As examples (Fig. 
when one hidden binary element determines the nature of 
pure pairwise interaction among the remaining elements, 
the observable subnetwork can have an effective 3-body 
interaction. Alternatively, for a network with only pure 
3-body interactions, hidden elements can induce an ef- 
fective 2-body interaction among the observables. 

As noted above, the connected information at second 




10 10 8 



Y Y P 

FIG. 3: Correlated-information of orders 2 and 3 and the multi- 
information, for networks of three binary observable elements, 
ai , CT2 , 0-3 , with one hidden binary element 0-4 . (a) Iq 's and I values 
for a network where the value of 0-4 determines the pairwise interac- 
tion between the other elements: if 04 = then 0-3 = A ND (01 , ai ) ; 
if (74 = 1 then the interaction among the observable variables is 
(pairwise) ferromagnetic with a finite temperature (/3 = 0.1). In- 
formation values are plotted as a function of 7 = P(<T4 = 0). 
P(<ri, 0"2, 0-3) in (b) same as a, but for a <T4-dependent mixture 
of AND and OR. For this case there is no effective 3-body inter- 
action, (c) Ic' s an d I f° r the three observable binary variables 
network, where the full 4 element network has pure 3-body inter- 
actions, plotted as a function of the inverse temperature ji. 

order (for example) cannot be written simply in terms of 
the mutual information among pairs of variables. Many 
previous authors have looked for linear combinations of 
mutual information measures which might provide mea- 
sures of higher order interaction, and among these one 
approach is of particular interest 0, IH IE 00 : If wc 
draw a Venn diagram of regions in the plane correspond- 
ing to the variables xi,x 2 ,x 3 , identifying areas with en- 
tropies, then the mutual information I(x\^x^) between 
two variables is the area of their intersection, and there 
is a unique region shared by all three variables; with the 
area-entropy correspondence the size of this "triplet in- 
formation" is 

h = ^ S(xj) - S(xj,Xj) + S(xi,x 2 ,x 3 ) 

i i<j 

= I(x 1 ;x 2 ;x 3 )-^2l(x i ;x j ). (8) 

i<j 

This proposal for measuring a pure triplet information 
has natural generalizations to more than three variables. 

There are at least two difficulties with the triplet in- 
formation defined by ^3 (see a thorough discussion in 
0). First, despite the identification of shared informa- 
tion with areas in the plane, we find that ^3 can be neg- 
ative (AND, OR and XOR in Fig.[TJ). Second, I 3 can be 
nonzero even for networks that have only pairwise inter- 
actions (FM in Fig. [U). 

Rather than "triplet information," ^3 actually mea- 
sures 0, Q the information that x\ and x 2 together pro- 
vide about x 3 with the information that these two vari- 
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ables provide separately: 

h = [I(xi; x 3 ) + I(x 2 ;x 3 )] - I({xi,X2}; x 3 ). 



(9) 



This comparative measure of information is symmetric 
under permutation of the indices, so the labeling of vari- 
ables as 1, 2, 3 is arbitrary. If I 3 is positive, then any pair 
x x and Xj are redundant in terms of the information that 
they provide about the remaining x^. If I 3 is negative, 
then there is synergy — two variables taken together are 
more informative than they are when taken separately. 

The question of synergy and redundancy brings us 
back to one of the primary motivations for this analy- 
sis. Consider the responses x\,X2, - ■ ■ , xn of a collection 
of elements to some stimulus y - for example a group 
of neurons responding to a sensory stimulus. For each 
neuron i we can ask how much information the response 
provides about the sensory world, I(xi; y). When we look 
at a pair of neurons, we can ask whether these neurons 
provide redundant or synergistic information (using eq. 
M see e.g. HEU). Similarly for a large population of 
neurons we can compare the information in the popula- 
tion, I({xi};y), with the sum of informations provided 
by the neurons individually, J^-TCxi; J/). This compari- 
son, however, does not tell us whether (for example) the 
synergy in the population is the result of pairwise cor- 
relations or whether there are special combinations of 
responses across all three or more neurons which pro- 
vide extra information. The possible significance of such 
multi-neuron combinatorial events has been discussed for 
many years (see e.g. 0,0, El)- 

We recall that the information provided by a popula- 
tion of neurons can be written as 



i{{ Xi y,y) = s[P({ Xi })} - (siPdxiMVy, 



(10) 



where 



denotes an average over the distribution of 



sensory inputs. The redundancy of the population is de- 
fined as 



iV 



R({xi}) = J2n^y) - l({xihy), 



(11) 



where negative R corresponds to synergy. We note that 
R can be written as the difference between two mufti 
information terms, 



JV 



R({^}) = [Y^S[P(x i )]-S[P({x i })}j 

, N 

Y^SlPix-M-SlPttx-My)}) .(12) 



i=i 

The first term is the multi-information in the distribu- 
tion of neural responses, which measures the extent to 
which the total "vocabulary" of the population is reduced 
through correlations, while the second term is the multi- 
information in the distribution of responses to a given 



stimulus. Each of these terms in turn can be expanded 
as a sum of connected informations, so that 



JV 



R({ Xi }) 



where Ic\{xi}\y) is the connected-information of order 
k in the network of {xi}, for a given value of y. By 
analogy with the discussion of synergy in pairs [16j , the 
terms Iq < {{x(\) quantify the contribution of kth order 
interactions to restricting the vocabulary of the popula- 
tion response (much as not all k letter combinations form 

words in English), while the terms (Ic\{ x i}\y))y quan- 
tify the contribution of kth order correlations to reducing 
the noise in the population response. 

To summarize, the maximum entropy construction of 
connected information presented here provides us both 
with a method for decomposing the correlations within 
a network and for quantifying the contribution of these 
correlations to the information that network states can 
provide about external signals. Since any part of a net- 
work can be thought of as 'external' to its compliment, 
this unified discussion of internal correlations and the 
representation of external signals is attractive. 
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